Skip to content

[SC-8983] Remove ydata-profiling library dependency#333

Merged
AnilSorathiya merged 7 commits intomainfrom
anilsorathiya/sc-8983/remove-ydata-profiling-library-dependency
Mar 11, 2025
Merged

[SC-8983] Remove ydata-profiling library dependency#333
AnilSorathiya merged 7 commits intomainfrom
anilsorathiya/sc-8983/remove-ydata-profiling-library-dependency

Conversation

@AnilSorathiya
Copy link
Contributor

Internal Notes for Reviewers

There are not much dependencies of the ydata-profiling library and new version showing the message which can confuse users of the vm-library.

@AnilSorathiya AnilSorathiya added the internal Not to be externalized in the release notes label Mar 9, 2025
@github-actions
Copy link
Contributor

PR Summary

This pull request introduces several changes to the codebase:

  1. Refactor Data Type Inference: The infer_datatypes function has been refactored and moved to validmind/utils.py. This function now includes enhanced logic for determining if a column is text-based using heuristics and pattern matching. It also provides detailed type information, including subtypes for numeric and text data.

  2. Remove Unused Dependencies: The dependencies on ydata_profiling and related imports have been removed from the codebase. This includes the removal of ProfilingTypeSet and Settings from ydata_profiling in the DatasetDescription.py and Skewness.py files.

  3. Test Data Update: The test data in test_DatasetDescription.py has been updated to include more realistic text examples, such as email addresses and longer text strings.

  4. Dependency Cleanup: The poetry.lock and pyproject.toml files have been updated to remove the ydata-profiling dependency, along with other unused packages like dacite, htmlmin, imagehash, multimethod, phik, pywavelets, typeguard, visions, wordcloud, and ydata-profiling.

These changes aim to streamline the codebase by removing unnecessary dependencies and improving the robustness of data type inference.

Test Suggestions

  • Test the infer_datatypes function with a variety of DataFrame inputs to ensure it correctly identifies column types, including edge cases like all-null columns.
  • Verify that the is_text_column function accurately classifies text columns using different patterns and thresholds.
  • Run unit tests on the Skewness function to ensure it correctly uses the refactored infer_datatypes function.
  • Check that the removal of ydata_profiling and related imports does not affect the functionality of existing features.
  • Ensure that the updated test data in test_DatasetDescription.py produces the expected results.

Copy link
Contributor

@johnwalz97 johnwalz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful 😍...

my only suggestion is maybe to break the utils.py file into separate files (dataset_utils.py, model_utils.py etc) since its getting quite long

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫡

@AnilSorathiya AnilSorathiya merged commit f296019 into main Mar 11, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Not to be externalized in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants